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Abstract 



Context. The huge and still rapidly growing amount of galaxies in modern sky surveys raises the need of an automated 
and objective classification method. Unsupervised learning algorithms are of particular interest, since they discover 
classes automatically. 

Aims. We briefly discuss the pitfalls of oversimplified classification methods and outline an alternative approach called 
" clustering analysis" . 

Methods. We categorise different classification methods according to their capabilities. Based on this categorisation, we 
present a probabilistic classification algorithm that automatically detects the optimal classes preferred by the data. We 
explore the reliability of this algorithm in systematic tests. Using a small sample of bright galaxies from the SDSS, we 
demonstrate the performance of this algorithm in practice. We are able to disentangle the problems of classification 
and parametrisation of galaxy morphologies in this case. 

Results. We give physical arguments that a probabilistic classification scheme is necessary. The algorithm we present 
produces reasonable morphological classes and object-to-class assignments without any prior assumptions. 
Conclusions. There are sophisticated automated classification algorithms that meet all necessary requirements, but a 
lot of work is still needed on the interpretation of the results. 

Key words. Galaxies; Surveys; Methods: data analysis, statistical 



1. Introduction 

Classification of objects is typically the first step towards 
scientific understanding, since it brings order to a previ- 
ously unorganised set of observational data and provides 
standardised terms to describe objects. These standardised 
terms are usually qualitative, but they can also be quanti- 
tative which makes them accessible for mathematical anal- 
ysis. A famous example of a successful classification from 
the field of astrophysics is the Hertzsprung- Russell diagram, 
where stars exhibit distinct groups in the colour-magnitude 
diagram that represent their different evolutionary stages. 
For the same reason, galaxy classification is an important 
conceptual step towards understanding physical properties, 
formation and evolution scenarios of galaxies. 

With the advent of modern sky surveys containing mil- 
lions (e.g. SDSS, COSMOS, PanSTARRS, GAMA) or even 
billions (e.g. LSST) of galaxies, the classification of these 
galaxies is becoming more and more problematic. The vast 
amount of data excludes the hitherto common practice of 
visual classification and clearly calls for an automated clas- 
sification scheme that is more efficient and more objective. 
In this work, we present an algorithm for automated and 
probabilistic classification, where the classes are discovered 
automatically, too. However, the intention of this work is 
not to come up with "yet another morphological classifi- 
cation scheme" , but rather to demonstrate of how it could 
be done alternatively to the standard practice of classifica- 
tion in astrophysics. Besides, we are unable to present a full 



solution to the problem of morphological galaxy classifica- 
tion, since there is still no accepted meth od for parametr is- 



ing a rbitrary galaxy morphologies (cf. Andrae et al. 



m 



prep.). In addition to the lack of convincing classification 



schemes, this is why many experts are very sceptical about 
the subject of classifying galaxy morphologies as a whole. 
As parametrisation of galaxy spectra is more reliable, spec- 
tral classifications have become more accepted. 

In the remaining part of this introduction, we first give 
an overview about modern automated classification meth- 
ods and work out a categorisation of these methods. We 
describe o ur parametrisat ion of galaxy morphologies using 
shapelets (Refregier 2003) in Sect. [2] In Sect. [3] we present 
the alg orithm we are us ing, which has been introduced be- 
fore by |Yu et al. ([2005 ) in the field of pattern recognition. 
We extensively investigate the reliability of this classifica- 
tion algo rithm in Sect. [4 } Such a study has not been under- 
taken by|Yu et al. (2005). In Sect.[5]we present a worked ex- 
ample with a small sample of 1,520 bright galaxies from the 
SDSS. The objects in this sample are selected such that no 
practical problems with parametrisation arise, as we want 
to disentangle the problems of classification and parametri- 
sation as much as possible. The aim of this worked example 
is not to do science with the resulting classes or data-to- 
class assignments, but to demonstrate that such an algo- 
rithm indeed produces reasonable results. We conclude in 
Sect. H 
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Type 


Classification 


Clustering 


Mara 


nearest neighbour, 


i^-means, 






ynopf i"Q 1 nliicforinn' 




discriminant analysis 


kernel PCA 


Soft 


naive Bayes, 


Gaussian mixture 




linear / quadratic 


models 




discriminant analysis, 






neural networks 





Table 1. Overview of different classification and clustering 
algorithms with examples. 

Soft (probabilistic) algorithms are always model-based, 
whereas hard algorithms are not necessarily. Soft algo- 
rithms can always be turned into hard algorithms, but not 
vice- versa. The list of example algorithms is not complete. 



1.1. Overview about classification methods 

In Table [l] we give an overview of different classification 
methods and some example algorithms. The two criteria 
for this categorisation are: 

1. Is the data-to-class assignment probabilistic (soft) or 
not (hard)? 

2. Are the classes specified a priori (classification) or dis- 
covered automatically (clustering)? 

Not all algorithms fit into this categorisation, namely those 
that do not directly assign classes to objects (e.g. self- 
organising maps). 

The algorithm we are going to present is a soft algo- 
rithm, i.e. the data-to-class assignment is probabilistic (cf. 
next section) . The reason is that in the case of galaxy mor- 
phologies, it is obvious that the classes will not be clearly 
separable. We rather expect the galaxies to be more or less 
homogeneously distributed in some parameter space, with 
the classes appearing as local overdensities and exhibi ting 
potentially strong overlap. As we demonstrate in Sect. |4.2[ 
hard algorithms break down in this case, producing biased 
classification results. There are physical reasons to expect 
overlapping classes: First, the random inclination and ori- 
entation angles w.r.t. the line of sight induce a continu- 
ous transition of apparent axis ratios, apparent steepness 
of the radial light profiles and ratio of light coming from 
bulge and disk components. Second, observations of galax- 
ies show that there are indeed transitional objects between 
different morphological types. For instance, there are tran- 
sitional objects between early- and late- type galaxies in the 
" green valley" of the colo ur bimodality (e.g. |Strateva et al. 
2001 Baldry et al.||2004[), wh ich is also reproduced in sim- 
ulations QCroton et al. 2006[ ). Hence, we have to draw the 
conclusion that hard algorithms are generically inappropri- 
ate for analysing galaxy morphologies. This conclusion is 
backed up by practical experience, since even various spe- 
cialis ts usually do not agr ee in hard visual classifications 
(e.g. Bamford et al. 2009). In fact, the outcome of multi- 
person visual classifications becomes a probability distribu- 
tion automatically. 

Furthermore, our algorithm is a clustering algorithm, 
i.e. we do not specify the morphological classes a priori, 
but let the algorithm discover them. This approach is called 
"unsupervised learning" and it is the method of choice if 
we are uncertain about the type of objects we will find in 
a given data sample. In the context of clustering analysis 



classes are referred to as clusters, and we adopt this termi- 
nology in this article. 

1.2. Probabilistic data-to-class assignment 

Let O denote an object and x its parametrisation. 
Furthermore, let Ck denote a single class out of k = 1, . . . , K 
possible classes, then prob(c/e|x) denotes the probability of 
class Ck given the object O represented by x. This condi- 
tional probability prob(ck\x) is called the class posterior 
and is computed using Bayes' theorem 



prob(c k \x) 



prob(c fc )prob(a;|c/ c ) 
prob(cc) 



(i) 



The marginal probability prob(cfc) is called class prior 
and prob(cc|c/c) is called class likelihood. The denominator 
prob(cc) acts as a normalisation factor. The class prio r and 
likelihood are obtained from a generative model (Sect. pT3| ). 
Prior and posterior satisfy the following obvious normali- 
sation constraints 



K 



K 



prob(cfc) = 1 and prob(cfc|a?) = 1 



(2) 



k=l 



k=l 



which ensure that each object is definitely assigned to 
some class. In the case of hard assignments, both poste- 
rior prob(c/c|x) and likelihood prob(a?|c/c) are replaced by 
Kronecker symbols. 



2. Parametrising galaxy morphologies with 
shapelets 

2.1. Basis functions and expansion 

We parametrise galaxy morphologies in terms of shapelets 
( jRefreg ier 2003). Shapelets are a scaled version of two- 
dimensional Gauss-Hermite polynomials and form a set of 
complete basis functions that are orthonormal on the inter- 
val [—oo, oo]. A given galaxy image I(x) can be decomposed 
into a linear superposition of basis functions B m}n (x/ 73), i.e. 



I{x) = ^ c m,nBm,n{x/j3) , 



(3) 



where the c m ^ n denote the expansion coefficients that con- 
tain the morphological information and j3 > denotes a 
scaling radius. In practice, the number of basis functions we 
can use is limited by pixel noise, such that the summation 
in Eq. ([3| stops at a certain maximum order iVmax < 00 
which depends on the object's signal-to-noise ratio and res- 
olution. This means Eq. (|| is an approximation only, 



I(x) 



m,n=0 



>(*//*)■ 



(4) 



We use the C++ algorithm by |Melchior et al] (|2007|) to 
estimate AT max , the scale radius and the linear coefficients, 
which was shown to be faster and more accurate than the 



IDL algorithm by Massey & Refregier (2005). 
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2.2. Problems with shapelet modelling 



3. Soft Clustering Algorithm 



It was shown by Melchior et al.| ( |2009b|_t hat the limitation 
of the number of basis functions in Eq. Q can lead to severe 
modelling failures and misestimations of galaxy shapes in 
case of objects with low signal-to-noise ratios. They identi- 
fied two origins of these biases: First, the Gaussian profile of 
shapelets does not match the true profiles of galaxies, which 
are typically much steeper. Second, the shapelet basis func- 
tions are intrinsically spherical, i.e. they have problems in 
modelling highly eccentric objects. However, in this demon- 
stration we consider only galaxies with high signal-to-noise 
ratios, where we can use many basis functions such that the 
impact of these biases is negligible. We demonstrate this 
in Fig. [I] where we show the shapelet reconstructions of a 
face-on disk, an edge-on disk and a n elli ptical galaxy drawn 
from the sample presented in Sect. 5.1 The reconstruction 
of the face-on disk galaxy (top row) is excellent, leaving es- 
sentially uncorrelated noise in the residuals. However, the 
reconstructions of the edge-on disk galaxy (centre row) and 
the elliptical galaxy (bottom row) exhibit ring-like artefacts 
that originate from the steep light profiles of the ellipti- 
cal and the edge-on disk along the minor axis. Such mod- 
elling failures appear systematically and do not introduce 
additional scatter into the results, i.e. similar galaxies are 
affected in a similar way. However, since shapelet models 
do not capture steep and strongly elliptical galaxies very 
well, we are aware that our algorithm has less dicrimina- 
tory power for galaxies of this kind. 



We no w present the soft clustering algorithm of Yu et al. 
(2005). Before we explain the details, we want to give a 
brief outline of the general method. The basic idea is to 
assign similarities to pairs of objects, so we first explain how 
to measure similarities of galaxy morphologies and what a 
similarity matrix is. These pairwise similarities are then 
interpreted by a probabilistic model, which provides our 
generative model. We also present the algorithm that fits 
the model to the similarity matrix. 



3. 1 . Est i ma ting sim ilarities 

Instead of analysing the data in shapelet space, we compute 
a similarity matrix by assigning similarities to any two data 
points. This approach is an alternative to working directly 
in the sparsely populated shapelet space or employing a 
method for dimensionality reduction. If we have N data 
points cc n , then this similarity matrix will be an N x N 
symmetric matrix. It is this similarity matrix to which we 
are going to apply the soft clustering analysis. 

Based on the pairwise distances in shapelet coefficient 
space (Eq. (|5|), we estimate pairwise similarities up to a 
constant factor as 



Wmn OC 1 



(d(xm,x n )/d 



max J 



(6) 



2.3. Distances in shapelet space 

The coefficients form a vector space and we denote them 
as vectors x. In first-order approximation, these coefficient 
vectors are independent of the size of the object, which 
was encoded by the scale radius /3. Moreover, we can also 
make x invariant against the image flux, since Eq. {5} im- 
plies that for a constant scalar a / the transformation 
x — > ax changes the image flux by this same factor of a. 
Therefore, if we demand x • x = 1, then differing image 
fluxes will have no impact on the shapelet coefficients. This 
implies that morphologies are a direction in shapelet coeffi- 
cient space and the corresponding coefficient vectors lie on 
the surface of a hypersphere with unit radius. We can thus 
measure distances between morphologies on this surface via 
the angle spanned by their (normalised) coefficient vectors, 



d(xi,X2) = < (xi,X2) = arccos {x\ • X2) • 



(5) 



Employing the polar representation of shapelets (Mass ey fc| 
Refregier 2005), we can apply rotations and parity flips to 
shapelet models. We can estimate the object's orientation 
angle from the secon d moments of its light distribution (e.g. 
Melchior et al.|2007[ ) and then use this estimate to align all 
models. This ensures invariance of the coefficients against 
random orientations. Additionally, we can break the de- 
generacy between left- and right-handed morphologies by 
applying parity flips such that the distance of two objects 
is minimised. These transformations in model space do not 
suffer from pixellation errors and increase the local density 
of similar objects in shapelet space. 



Here <imax denotes the maximum distance between any 
two objects in the given data sample, while the exponent 
a > and s > 1 are free parameters that tune the sim- 
ilarity measure. We explain how to choose a and s in 
Sect. |4.3| This definition ensures that < W mn < 1 and 
that the maximum similarities are self-similarities for which 
d(x m ,x m ) = 0. Note that this similarity measure is invari- 
ant under changes of size, flux, orientation, and parity of 
the galaxy morphology. 

3.2. Similarity matrices and weighted undirected graphs 

Square symmetric similarity matrices have a very intu- 
itive interpretation: They represent a weighted undirected 
graph. Figure [2] shows a sketch of such a graph. The data 
points x n are represented symbolically as nodes x n . The 
positions of these nodes are usually arbitrary, it is neither 
necessary nor helpful to arrange them according to the true 
locations of the data points in parameter space. Any two 
data nodes x m and x n are connected by an edge, which 
is assigned a weight W mn . Obviously, all the weights W mn 
form diii N x N matrix W and if this matrix is symmetric, 
i.e. Wm n = W nm , the edges will have no preferred direction. 
In this case, the weighted graph is undirected. In graph the- 
ory the matrix of weights W is called adjacency matrix, and 
we can interpret the similarity matrix as adjacency matrix 
of a weighted undirected graph. 

Inspecting Fig. |2j we now introduce some important 
concepts. First, we note that there is also an edge con- 
necting x\ with itself. This edge is weighted by the "self- 
similarity" W\\. These self-similarities W nn are usually 
non-zero and have to be taken into account in order to 
satisfy normalisation constraints (cf. Eq. (J8|). Second, we 
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Figure 1. Examples of shapelet models of three galaxies from SDSS (g band). 

Shown are the original images (left column), the shapelet models (centre column) and the residuals (right column) of a face-on 
disk galaxy (top row), an edge-on disk galaxy (centre row) and an elliptical galaxy (bottom row). Note the different plot ranges of 
the residual maps. The shapelet decomposition used iVmax = 16, i.e. 153 basis functions. 



define the degree d n of a data node x n as the sum of weights 
of all edges connected with x ni i.e. 

N 

dn=Y. W ^n • (7) 

m=l 

We can interpret the degree d n to measure the connectivity 
of data node x n in the graph. For instance, we can detect 
outlier objects by their low degree, since they are very dis- 
similar to all other objects. Third, we note that we can 
rescale all similarities by a constant scalar factor C > 



without changing the pairwise relations. Hence, we acquire 
the normalisation constraint 

N N 
m,n=l n=l 

This constraint ensures the normalisation of the probabilis- 
tic model we are going to set up for our soft clustering 
analysis of the similarity matrix. 
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15 



12 



14 
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Figure 2. Sketch of a weighted undirected graph. 
The data nodes x n are connected by edges. For the sake of 
visibility, only edges connecting x\ are shown. The edges 
are undirected and weighted by the similarity of the two 
connected nodes. 




Figure 3. Sketch of a bipartite graph. 
The bipartite graph contains two sets of nodes, X — 
{xi, . . . ,xn} and C = {ci, . . . , ck}. Edges connect nodes 
from different sets only and are weighted by an adjacency 
matrix B. Not all edges are shown. 



3.3. Bipartite-graph model 

We need a probabilistic model of the similarity matrix W 
that can be interpreted in terms of a soft clustering a nal- 
ysis. Such a model was proposed by Yu et al. (2005). As 
similarity matrices are closely related to graphs, this model 
is motivated from graph theory, too. The basic idea of this 
model is that the similarity of two data points x m and x n 
is induced by both objects being members of the same clus- 
ters. This is the basic hypothesis of any classification ap- 
proach: Objects from the same class are more similar than 
objects from different classes. 

In detail, we model a weighted undirected graph (Fig. 
|2| and its similarity matrix by a bipartite graph (Fig. [3]). 
A bipartite graph is a graph whose nodes can be divided 
into two disjoint sets X — {xi, ...,xn} of data nodes and 
C — {ci, . . . ,ck} of cluster nodes, such that the edges in 
the graph only connect nodes from different sets. Again, 
the edges are weighted and undirected, where the weights 
B n k form an N x K rectangular matrix B, the bipartite- 



graph adjacency matrix. The bipartite-graph model for the 
similarity matrix then reads 



Wrr 



K R R 

k=l 



(9) 



with the cluster priors = J2n=i Bnk- A detailed deriva- 
tion is given in the following section. This model induces 
the pai rwise similaritie s via two-hop transitions X —> C —> 
X (cf. [Yu et ah 2005). The numerator accounts for the 
strength of the connections of both data nodes to a certain 
cluster. The impact of the denominator is that the com- 
mon membership to a cluster of small degree is considered 
more decisive. Obviously, the model defined by Eq. (|9| is 
symmetric, as the similarity matrix itself. The normalisa- 
tion constraint on W as given by Eq. ([8| translates via the 
bipartite-graph model to 



K N 



K 



Bnk = Yl Xk = 1 



(10) 



k=l n=l 



k=l 



These constraints need to be respected by the fit algorithm. 
Having fitted the bipartite-graph model to the given data 
similarity matrix, we compute the cluster posterior proba- 
bilities 



prob(c k \x n ) 



prob(x n ,c fc ) 
prob(x n ) 



B n k 



(11) 



which are the desired soft data-to-cluster assign- 
ments. Obviously, K cluster posteriors are assigned to 
each data node x„ and the normalisation constraint 



^2k=i P r0D ( c /c|^n) = 1 is satisfied. 



3.4. Mathematical derivation 



Here we give a derivation of the bipartit e-graph mo del of 
Eq. ^ that is more detailed than in|Yu et al.| (|2Q05|>. The 



ansatz is to identify the similarity W mn with the joint prob- 
ability 



prob(x m ,x n ) . 



(12) 



This interprets W mn as the probability to find x m and 
x n in the same cluster. Eq. Q ensures the normalisation 
J2m n=i P r ob(x m , x n ) = 1. As we do not know which par- 
ticular cluster induces the similarity, we have to marginalise 
over all cluster nodes in Fig. [3J 



K 



prob(x m ,x n ) = ^prob(x m ,x n ,c fe ) . 



(13) 



k=l 



With this marginalisation we have switched from the 
weighted undirected graph to our bipartite-graph model. 
Applying Bayes' theorem yields 



prob(x m ,x n ) 



K 

E 

k-i 



prob(x n |c fc )prob(x m ,c/ e ) , 



where we have used 
prob(x n |x m ,c/e) = prob(x n |c/e) , 



(14) 



(15) 
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since x m and x n are not directly connected in the bipar- 
tite graph, i.e. they are statistically independent. This is 
the only assumption in this derivation and it implies that 
all statistical dependence is induced by the clusters. Using 
Bayes' theorem once more yields 



K 



prob(x m ,x n ) ^ 



k=i 



probp n , c fc ) prob(x m , c k ) 
prob(c fc ) 



(16) 



We identify the bipartite-graph adjacency matrix in anal- 
ogy to Eq. p2i, 



B nk = prob(x n ,c fc ) , 
with its marginalisation 

N 



N 



X k = prob(c/e) = ^2 prob(x n , c k ) = ^ B nk 



(17) 



(18) 



The marginalised probabilities X k are the cluster priors of 
the cluster nodes c k in the bipartite graph. Moreover, the 
X k are the degrees of the nodes. 

3.5. Fitting the similarity matrix 

In order to fit the bipartite-graph model denned by Eq. 
|9l to a given similarity matrix, we perform some simpli- 
fications. First, we note that we can rewrite Eq. (|9| using 
matrix notation, 



W = BA~ 1 -B 1 



(19) 



where B is the N xK bipartite-graph adjacency matrix and 
A = diag(Ai, . . . , X k ) is the KxK diagonal matrix of cluster 
degrees. This notation enables us to employ fast and effi- 
cient algorithms from linear algebra. We change variables 

by 



B = H ■ A , 



(20) 



where H is an N x K matrix. The elements of H can be 
interpreted as the cluster likelihoods, since H nk = = 

^p^obCc ^ = P r °k( x ™l c fc)- Using these new variables H 

and A, the model W of the data similarity matrix W is 
given by 



W = H • A • H 1 



(21) 



where we have eliminated the matrix inversion and reduced 
the nonlinearity to some extent. The normalisation con- 
straints of Eq. ( 10 ) translate to H as 



N 



H n k 



N 



prob(x n |c fc ) 



V* 



,K. (22) 



The normalisation constraints on H and A are now decou- 
pled, and we can treat both matrices as independent of each 
other. As H is an N x K matrix and AaKxK diagonal 
matrix, we have K{N + 1) model parameters. In compar- 
ison to this number, we do have ^N(N + 1) independent 
elements in the data similarity matrix due to its symmetry. 
Hence, a reasonable fit situation requires ^> K in order 
to constrain all model parameters. 



The data similarity matrix W is fitted by maximising 
the logarithmi c likelihood f unction log£ of the bipartite- 
graph model. Yu et al. (2005) give a derivation of this func- 
tion based on the theory of random walks on graphs. Their 
result is 



N 



log£(0|WO= ^mn log prob(x |e), 



(23) 



where = {iin, . . . , Hnk-, Ai, . . . , A^} denotes the set 
of K(N + 1) model parameters and prob(x m , x n \Q) = 

Hk=i HmkX k H nk = Wmn ' ls the model. If we remem- 
ber that Wm n — prob(x m , x n ), then we see that log£ 
is the cross entropy of the true probability distribu- 
tion W mn — prob(x m ,x n ) and the modelled distribution 
Wmn prob(x m , x n |0). Consequently, maximising log£ 
maximises the information our model contains about the 
data similarity matrix. 

Directly maximising log C is too hard, since the fit pa- 
rameters are subject to the constraints given by Eqs. (10) 
and (22). We use an alternative approach that makes use 



of the expectation-maximisation (EM) algorithm, which is 
an iterative fit routine. Given an initial guess on the model 
parameters, the EM algorithm provides a set of algebraic 
update equations to compute an improved estimate of the 
optimal parameters that automatically respects the nor- 
malisation. These update equations are ( |Bilmes 1997 Yu 
|eT"aI1[2005 ) 



\new \ 

A k = A k 



N 

E 



w„ 



i (H-K-HT) r 



Hm. k H-n 



N 

H nk W H nk \ k 



W„ 



i {H • A • H T ) r 



-H, 



mk - 



(24) 



(25) 



The H^ w have to be normalised by hand, whereas the 
A^ ew are already normalised. Each iteration step updates 
all the model parameters, which has time complexity G(K • 
N 2 ) for K clusters and N data nodes. We initialise all the 
cluster degrees to = ^, whereby we trivially satisfy 
the normalisation condition and simultaneously ensure that 
no cluster is initialised as virtually absent. The H® k are 
initialised randomly and normalised "by hand". 

Now, we want to briefly discuss the convergenc e proper- 
ties of the EM algorithm. It has been shown (e.g. Redner & 
Walker 1984) that the EM algorithm is guaranteed to con- 
verge to a local maximum of log£ under mild conditions. 
Indeed, it was shown that the EM algorithm is monoton- 
ically converging, i.e. each iteration step is guaranteed to 
increase log C. Therefore, after each iteration step, we check 
how much log£ was increased compared to the previous 
step. If log£ changed by less than a factor of 10 -9 , we will 
consider the EM algorithm to have converged. This conver- 
gence criterion was chosen based on systematic tests like 
those discussed in Sect. [4j Finally, we note that the fit re- 
sults are not unique, since the ordering of the clusters is 
purely random. 



3.6. Estimating the optimal number of clusters 

In this section we demonstrate how to estimate the optimal 
number of clusters for a given data set, which is a crucial 
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both large and small values of pairwise similarities are de- 
cisive. We estimate the optimal K via the position of a kink 
in the function SSR(if) (cf. Fig. E|. Such a kink arises if 
adding a further cluster does not lead to a significant im- 
provement in the similarity-matrix reconstruction. 

We demonstrate this procedure in Fig. [4] by using the 
toy example of Figs. [6] and [7| which is composed of six 
nicely separable clusters. We fit bipartite-graph models to 
the similarity matrix shown in Fig. [7J with K ranging from 
1 to 15. The resulting SSR values are shown in panel (a) 
of Fig. [4j In fact, SSR(if) exhibits two prominent kinks at 
K = 3 and K = 6, rather than a single one. Obviously, 
for K = 3, the clustering algorithm groups the four nearby 
clusters together, thus resulting in three clusters. For K = 
6, it is able to resolve this group of "subclusters". 

We can construct a more quantitative measure by com- 
puting the angular change A(K) of log SSR(if) at each K, 



4 6 8 10 12 14 

number of clusters K 



A(K) = arctan [logSSR(K - 1) - logSSR(K)] 



arctan [log SSR(K) - log SSR(K + 1)] 



(29) 



Figure 4. Estimating the optimal number of clusters for 
the data sample shown in Fig. [6j 

(a) SSR(i^) as a function of the number K of clusters, (b) Mean 
angular changes (A(K)) averaged over ten fits. 



part of any clustering analysis. It is essential to estimate 
the optimal number of clusters with due caution. This is 
a problem of assessing nonlinear models and there are no 
theoretically justified methods, there are only heuristic ap- 
proaches. Common heuristics are the Bayesian information 
criterion 



BIC = -2 log£+p log TV 

and Akaike's information criterion 

AIC = -2 log£ + 2p, 



(26) 



(27) 



where p and N denote the number of model parameters 
and the nu mber of data points, respectively. As we have 
seen in Sect. 3.5, the bipartite-graph model involves K(N-\- 
1) model parameters. Consequently, BIC and AIC are not 
applicable, since log£ is not able to compensate for the 
large impact of the penalty terms. This inability of log C is 
likely to originate from the sparse data population in the 
high-dimensional parameter space. Another tool of model 
assessment is cross-validation, but this is computationally 
infeasible in this case. 

We now explain how to compare bipartite-graph models 
of different complexities heuristically, i.e. how to estimate 
the optimal number of clusters. This heuristic employs the 
sum of squared residuals 



SSR(iT) 




Wmr> 



(28) 



The definition puts equal emphasis on all elements. If we 
left out the denominator in Eq. (28), the SSR would em- 
phasise deviations of elements witn large values, whereas 
elements with small values would be neglected. However, 



As K is an integer, logSSR(if) is a polygonal chain and 
thus an angular change is well defined. A large positive 
angular change then indicates the presence of a kink in 
SSR(K)Q However, we can even do better by fitting the 
similarity matrix several times for each K and averaging 
the angular changes. The results of the fits differ slightly, 
since the model parameters are randomly initialised each 
time. These mean angular changes are shown in panel (b) 
of Fig. [4j averaged over 20 fits for each K. First, for large 
K the mean angular changes are consistent with zero, i.e. 
in this domain increasing K decreases SSR(if) but does 
not improve the fit systematically. Second, for K = 3 and 
K = 6 the mean angular changes deviate significantly from 
zero. For K = 2 and K = 4, the mean angular changes are 
negative, which corresponds to "opposite" kinks in the SSR 
spectrum and is due to K = 3 being a very good grouping 
of the data. 

For large K these detections may be less definite due to 
the flattening of SSR(lT). Therefore, we may systematically 
underestimate the optimal number of clusters. Moreover, 
this toy example also demonstrates that there may be more 
than a single advantageous grouping of the data and there 
may be disadvantageous groupings. If there are multiple 
detections of advantageous groupings, it may be difficult to 
judge which grouping is the best. In the worst case, we even 
may not find any signal of an advantageous grouping, which 
would either imply that our given sample is composed of 
objects of the same type or that the data does not contain 
enough information about the grouping. Unfortunately, this 
scheme of estimating the optimal number of clusters is ex- 
tremely inefficient from a computational point of view. This 
is a severe disadvantage for very large data sets. Moreover, 
though this heuristic is working well, the significance of the 
mean angular changes is likely to be strongly influenced by 
the variance caused by the algorithm's initialisation. 



1 It is not possible to compute the angular change for K = 1, 
but this case is not a reasonable grouping anyway under the 
assumption that there are objects of different types in the given 
data sample. 
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3. 7. Comparison with previous work 



As the work of |Kelly fc McKay | ( |2004[ |2005| ) is very close 
to our own work, we want to discuss it in some detail 
and work out the differences. The authors applied a soft 
clustering analy s is to the first data release of SDSS. In 
Kelly & McKay (2004) they decomposed r-band images of 
3,0 37 galaxies into shapelets, using the IDL shapelet code 



by |Massey fc Refregier| ( |2005p . In |Kelly fc McKay| ( |2005 ) 
they extended this scheme to all five photometric bands 
u,g,r,i,z of SDSS, thereby also taking into account colour 
information. Afterwards, they used a principal component 
analysis (PCA) t o reduce the dimensio nality of their pa- 
rameter space. In Kelly & McKay ( 20 04[) the reduction was 
from 91 



to 9 dimensions and in |Kelly fc McKay| (|2005j) 
from 455 to 2 dim ensions. Then they fitted a mixture-of- 
Gaussians model (|Bilmes 1997) to the compressed data, 



where each Gaussian component represents a cluster. They 
were able to show that the resulting clusters exhibited a 
reasonable correlation to the traditional Hubble classes. 

Reducing the parameter space with PCA and also using 
a mixture-of-Gaussians model are both problematic from 
our point of view. First, PCA relies on the assumption 
that those directions in parameter space that carry the 
desired information do also carry a large fraction of the 
total sample variance. This is neither guaranteed nor can it 
be tested for in practice. Second, galaxy morphologies are 
not expected to be normally distributed. Therefore, using 
a mixture-of-Gaussians model is likely to misestimate the 
data distribu tion. Nonetheless, the work by |Kelly fc Mc Kay 
([20041 [2005) was a landmark, both concerning their usage 
of a probabilistic algorithm and conceptually, by applying 
a clustering analy sis to the first data release of SDSS. 

In contrast to |Kelly fc McKay| ( |2004[ |2005[ ), we do not 
reduce the dimensionality of the parameter space and then 
apply a clustering algorithm to the reduced data. We also 
do not try to model the data distribution in the parameter 
space, which would be virtually impossible due to its high 
dimensionality (curse of dimensionality, cf. Bell man|1961[ ). 
Rather, we use a similarity matrix, which has two major 
advantages: First, we do not rely on any compression tech- 
nique such as PCA. Second, we cannot make any mistakes 
by choosing a potentially wrong model for the data distri- 
bution, since we model the similarity matrix. There are two 
sources of potential errors in our method: 



1. 



2. 



Estimation of pairwise similarities (Eq. (|6|). This is 
hampered by our lack of knowledge about the metric 
in the morphological space and it is in some sense sim- 
ilar to mismodelling. 

Modelling the similarity matrix by a bipartite-graph 
model. As the only assumption in the derivation of the 
bipartite-graph model is Eq. (15), this happens if and 



only if a significant part of the pairwise similarity is not 
induced by the clusters, but rather by e.g. observational 
effects. However, any other classification method (auto- 
mated or not) will have problems in this situation, too. 



4. Systematic Tests 

In this section we conduct systematic tests using artificial 
data samples that are specifically designed to investigate 
the impact of certain effects. First, we demonstrate that 
hard classification schemes cause problems with subsequent 



parameter estimation. Furthermore, we investigate the im- 
pact of non-optimal similarity measures, two-cluster sepa- 
ration, noise and cluster cardinalities on the clustering re- 
sults. 



4.1. Overview 

We start by describing the artificial data sets that we are 
going to use. Furthermore, we describe the diagnostics by 
which we assess the performance of the clustering algo- 
rithm. 

The data sets are always composed of two clusters, 
where the number of example objects drawn from each clus- 
ter may be different. The clusters are always designed as 
p-variate Gaussian distributions, i.e. 



prob(cc|/L£, E) = 



exp 



y / (27r) p det E 



(30) 



where /j, and E denote the mean vector and the covariance 
matrix, respectively. 

Knowing the true analytic form of the underlying prob- 
ability distributions, we are able to assess the probabilistic 
data-to-cluster assignments proposed by the clustering al- 
gorithm. For two clusters A and B, the true data-to-cluster 
assignment of some data point x to cluster k = A, B is 
given by the cluster posterior 



prob(/c|cc) 



prob(x|^ fe ,E fe ) 



prob(#|^ A ,£U) +prob(#|/Lfc £ >,E! £ >) ' 



(31) 



The numerator prob(cc|/x/ c , £&) is the cluster likelihood. The 
cluster priors prob(A) = prob(£>) = \ are flat and cancel 
out. For a given data set of N objects, these true cluster 
posteriors are compared to the clustering results using the 
expectation values of the zero-one loss function 



1 



N 



O prob fit (C n |cc n ) 

> prob fit (^C n |# n ) 



1 else 



and of the squared-error loss function 



1 N 

( £ SE) = x (prob fit (C n |;r n ) - prob true (C n |a; n ))' 



(32) 



,(33) 



where C n denotes the correct cluster label of object x n and 
-iC n the false label. The zero-one loss function is the mis- 
classification rate, whereas the squared-error loss function is 
sensitive to misestimations of the cluster posteriors that do 
not lead to misclassifications. As the two clusters are usu- 
ally well separated in most of the following tests, the true 
maximum cluster posteriors are close to 100%. Therefore, 
misestimation means underestimation of the maximum pos- 
teriors, which is quantified by yj 

4.2. impact of hard cuts on parameter estimation 

In this first test, we want to demonstrate that hard cuts 
that are automatically introduced when using hard classifi- 
cation or clustering algorithms can lead to systematic mis- 
estimations of parameters, i.e. biases. This is a general com- 
ment in order to support our claim that hard data-to-class 
assignments are generically inappropriate for overlapping 
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Figure 5. Break-down of hard classifications in case of 
overlapping clusters. 

Deviation [la — Ha of estimated and true means vs. two-cluster 
separation Ax for class A for hard estimator (red line), soft 
estimator (blue line), and predicted bias of hard estimator for 
Ax — > (dashed line). From 1,000 realisations of data samples 
we estimated errorbars, which are shown but too small to be 
visible. 



classes. We are not yet concerned with our soft-clustering 
algorithm. We use two one-dimensional Gaussians with 
means ha and variable two-cluster separation Ax = 
Ha — ^Bi and constant variance a 2 = 1. From each Gaussian 
cluster we then draw N =10,000 objects. From the resulting 
data sample we estimate the means [ik of the two Gaussians 
and compare with the true means jik- The results are aver- 
aged over 1,000 realisations of data samples. 

Figure [5] shows the deviations of the estimated from the 
true means when using a hard cut at x = (red line) and a 
weighted mean (blue line). A hard cut at x = that assigns 
all data points with x < to class A and those with x > 
to class B is the most reasonable hard classification in this 
simple example. Once the complete sample is divided into 
two subsamples for classes A and B, we estimate the usual 
arithmetic mean 



-hard 



1 N k 

— X k,n • 



(34) 



As Fig. [5] shows, this estimator is strongly biased in case 
of overlapping clusters (Ax — >• 0). In the limit of Ax = 0, 
we can predict this bias analytically from the expectation 
value 



(x)a/b = T2 r dx x e- 2 / 2 = -^f= « T 0. 



7979, 



(35) 



where the integration is only over one half of the parameter 
space and the factor of 2 arises from both Gaussians con- 
tributing the same for Ax = 0. This bias is shown as dashed 
line in Fig. [5) where for Ax = also /jLa/b = ^Ax/2 — 0. If 
we employ the true posteriors defined by Eq. ( [31] ) as weights 
and use 



Al oft 



^^ =1 prob(fe|x n )x ? 

En=lP r ° b (^K) 



(36) 
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Figure 6. Artificial data sample with six clusters (top) and 
the matrix of pairwise Euclidean distances (bottom). 
Each cluster has an underlying bivariate Gaussian distribution 
with covariance matrix E = diag(l,l). We sampled 50 data 
points from each cluster. 



then we will get an unbiased estimate despite the overlap, 
as is evident from Fig. |5| This comparison demonstrates 
the break-down of hard algorithms in case of overlapping 
clusters. 



4.3. Impact of non-optimal similarity measures 

We now explain how to optimise the similarity measure 
defined in Eq. (|6| and what "optimal" means. Given the 
N x N symmetric matrix of pairwise distances <i(a? m ,a? n ), 
we can tune the similarity measure by adjusting the two 
parameters a and s. Tuning the similarity measure has to 
be done with care, since there are two undesired cases: First, 
for a — >• 00, the resulting similarity matrix approaches a 
constant, i.e. W mn = ^2 for all elements, since d(x m , x n ) < 
drnax- This case prefers K = 1 clusters, independent of 
any grouping in the data. Second, for a — > 0, the similarity 
matrix approaches the step matrix defined by 



{1 



<^> m = n 
^ rn 7^ n 



N 



(37) 



1. This case 



which is normalised such that ^ 
prefers K = N clusters. The optimal similarity measure 
should be as different as possible from these two worst cases. 
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Figure 7. Estimating the optimal similarity measure for 
the example data of Fig. [6| 

Top panel: Modified Manhattan distance C (Eq. ( |38| )) for s = 
1.01 (cyan line), s — 1.03 (blue line) and s — l.l(red line). 
For a — >• the matrix becomes a step matrix, which is why the 
constant levels depend on the scale parameter. Bottom panel: 
The resulting similarity matrix. 



We choose a and s such that the modified Manhattan dis- 
tance to the constant matrix 



N 



C =EE 



1 

JV2 



(38) 



is large. Figure [7] demonstrates how to tune the similarity 
measure using the artificial data set from the toy exam- 
ple of Fig. [6j The basis is the N x N symmetric distance 
matrix shown in the bottom panel of Fig. [6j For three dif- 
ferent values of s, the top panel shows C as functions of 
a. Obviously, C(a) exhibits a maximum and can thus be 
used to choose a. For s = 1.1 (red curve) the maximum is 
lowest and so is the distance to a constant matrix, s = 1.01 
exhibits the maximum deviation from a constant matrix, 
but this choice of s downweights off-diagonal terms in W 
according to Eq. (37). Thus, we also prefer if s is not too 
close to 1 and s = 1.03 (blue curve) is the compromise of 
the three scale parameters shown in Fig. [7] Note that the 
choice of s is not an optimisation but a heuristic. Although 
the artificial data set of Fig. [6] and its distance matrix are 
very special, we experienced that C(a) as shown in Fig. [7] 
is representative for the general case. 



The resulting similarity matrix is shown in the right 
panel of Fig. [7] and exhibits a block-like structure, since we 
have ordered the data points in the set. This is just for the 
sake of visualisation and does not affect the clustering re- 
sults. We clearly recognise six blocks along the diagonal, be- 
cause the within-cluster similarities are always larger than 
the between-cluster similarities. Furthermore, we recognise 
a large block of four clusters in the bottom right corner that 
are quite similar to each other, whereas the remaining two 
clusters are more or less equally dissimilar to all other clus- 
ters. Consequently, the similarity matrix indeed represents 
all the features of the data set shown in Fig. [6] 

We now demonstrate first that the optimal similarity 
measure indeed captures the crucial information on the 
data and what happens if we do not use the optimal similar- 
ity measure. We use an artificial data set composed of two 
one-dimensional Gaussian clusters, both with unit variance 
and two-cluster separation of Ax = 3. We sample 100 ex- 
ample objects from each cluster and compute the matrix of 
pairwise distances using the Euclidean distance measure. 
This data set and its distance matrix remain unchanged. 
For a constant parameter s = 2.25, we vary the exponent a 
in the similarity measure denned by Eq. (p| . For each value 
of a, we fit bipartite-graph models with K = 1, 2 and 3 to 
the resulting similarity matrix, averaging the results over 
15 fits each time. 

Results of this test are shown in Fig. [8] Panel (a) shows 
the modified Manhattan distance C to a constant matrix. 
This curve is very similar to Fig. [7] There is a prominent 
peak at a ~ 0.6, indicating the optimal similarity measure. 
If the similarity measure is very non-optimal, then the sim- 
ilarity matrix will be close to a constant or step matrix, 
i.e. it poorly constrains the bipartite-graph model. In this 
case, we expect to observe overfitting effects, i.e. low resid- 
uals of the reconstruction and results with high variance. 
The computation times are longer, too, since the nonlinear 
model parameters can exhibit degeneracies thereby slowing 
down the convergence. Counterintuitively, we seek a large 
value of SSR in this test, since a similarity matrix which 
captures well the information content of the data is harder 
to fit. Indeed, the SSR values shown in panel (c) of Fig. [8] 
are significantly lower for non-optimal a's and peak near 
the optimal a. As expected, the mean computation times 
shown in panel (b) are minimal for the optimal similarity 
measure. Panel (d) shows how the evidence for two clus- 
ters evolves0Near the optimal a, also the evidence for two 
clusters shows a local maximum. The misclassification rate 
shown in panel (e) is insensitive to a over a broad range, but 
approaches a rate of 50% rather abruptly for extremely non- 
optimal similarity measures. The squared-error loss shown 
in panel (f) is more sensitive to non-optimalities. It exhibits 
a minimum for the optimal a and grows monotonically for 
non-optimal values. 

The most important conclusion to draw from this test is 
that our method of choosing a and s for the similarity mea- 
sure denned in Sect. |3.1| is indeed "optimal", in the sense 
that it minimises both the misclassification rate and the 
squared-error loss. Additionally, we see that using the opti- 
mal similarity measure can also reduce computation times 
by orders of magnitude. 



2 This is the reason why we need to fit bipartite-graph models 
using K — 1, 2 and 3, in order to compute the angular change 
of SSR(iT) at K = 2. 
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Figure 8. Impact of non-optimal similarity measures on clustering results. 

(a) Modified Manhattan distance C (Eq. (38)). (b) Mean computation time per fit without errorbars. (c) Mean SSR(if) values 
of resulting fits for K — 2. (d) Mean angular change of SSR(if) at K — 2. (e) Mean misclassincation rate (solid line) and 50% 
misclassincation rate (dashed line), (f) Mean squared-error loss of maximum cluster posteriors. 



4.4. Impact of two-cluster overlap 

As we have to expect overlapping clusters in the context of 
galaxy morphologies, we now investigate the impact of the 
two-cluster overlap on the clustering results. The data sets 
used are always composed of 100 example objects drawn 
from two one-dimensional Gaussian clusters, both with unit 
variance. The two-cluster separation Ax is varied from 1 to 
1000. For each data set, we compute the matrix of pair- 
wise Euclidean distances and then automatically compute 
the optimal similarity matrix by op timi sing a using a con- 
stant s = 2.25 as described in Sect. |4.3| To each similarity 
matrix we fit bipartite-graph models with K = 1, 2 and 
3 clusters. Furthermore, we fit a if-means algorithm with 
K = 1,2 and 3 to each data set in order to compare the re- 
sults of both clustering algorithms. For each configuration, 
the results are averaged over 50 fits. 

Results of this test are summarised in Fig. M Panel (a) 
shows the mean evidence for two clusters, based on the an- 
gular changes in SSR(if ) for the bipartite-graph model and 
the within-cluster scatter for the if -means algorithm. For 
decreasing separation Ax, i.e. increasing overlap, the evi- 
dence for two clusters decreases for both algorithms, as is 
to be expectedj^] As panel (b) reveals, the misclassincation 
rates for if-means and the bipartite-graph model are both 
in agreement with the theoretically expected misclassifica- 
tion rate expected in the ideal case (black curve). For two 



one-dimensional Gaussians with means =L^r, the theoreti- 
cal misclassification rate is given by 

r 



/ r theo\ 

/ 



dx prob 



Ax 
~~2~ 



(39) 



3 Note that these two curves cannot be compared directly. 
Their agreement for Ax < 20 is coincidence. 



which measures the overlap of both Gaussians. In the limit 

Ax = 0, this yields (^o^ eo ) = \. The explanation for the 
excellent performance of both if -means and bipartite-graph 
model is, that in this case the clusters have equal cardi- 
nalities and are spherical. Nevertheless, the results of the 
if-means are biased due to the hard data-to-cluster assign- 
ment. Panel (c) of Fig. [9] shows the mean squared-error 
loss of the bipartite-graph models First, the general trend 
is that the squared-error loss increases for decreasing two- 
cluster separation. This is due to the growing amount of 
overlap confusing the bipartite-graph model. Second, for 
Ax < 4, the squared-error loss decreases significantly. This 
effect can be explained as follows: For very small separa- 
tions, the overlap is so strong that even the true cluster 
posteriors are both close to 50%. Therefore, the fitted clus- 
ter posteriors scatter around 50%, too, thereby reducing the 
squared error. Third, the squared error establishes a con- 
stant value of ~ 0.045 at large separations. In this 
case, the true maximum cluster posteriors are essentially 
100%, so this corresponds to a s ystema tic underestimation 
of the maximum posteriors of vX^SE)) ~ 21%. Due to the 

4 We do not compare with if-means, since if-means is a hard 
algorithm and squared-error loss is no reasonable score function 
in this case. 
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Figure 9. Impact of two-cluster overlap on clustering re- 
sults for if -means algorithm and bipartite-graph model, 
(a) Mean angular change of SSR(if ) (bipartite-graph model) 
and within-cluster scatter (if -means) at if — 2. (b) Mean mis- 
classification rates of if-means and bipartite-graph model (see 
text) compared to theoretical prediction. All curves coincide, (c) 
Mean squared-error loss of bipartite-graph model. 



Figure 10. Impact of noise variance on clustering results 
for if-means algorithm and bipartite-graph model, 
(a) Mean angular change of SSR(if) (bipartite-graph model) 
and within-cluster scatter (if -means) at if — 2. (b) Mean mis- 
classification rate of if -means and bipartite-graph model, (c) 
Mean squared-error loss of bipartite-graph model. 



large two-cluster separation, this bias does not lead to mis- 
classifications, as is evident from panel (b) in Fig. [91 This 
bias originates from the fact that any two objects nave a 
finite distance and thus a non- vanishing similarity. 

This test further demonstrates that the bipartite-graph 
model yields convincing results. This is most evident in the 
misclassification rate, which is in excellent agreement with 
the theoretical prediction of the best possible score. 



4.5. Impact of noise 

As observational data is subject to noise, we now investi- 
gate the response of the clustering results to noise on the 
similarity matrix. We simulate the noise by adding a sec- 
ond dimension y to the data. The two clusters are bivariate 
Gaussian distributions, both with cr^ = 1 and two-cluster 
separation of Ax = 10 and Ay = 0. We vary the size of 
the variance in ^-direction ranging from a 2 y = 0.1 to 10000, 
thereby introducing noise that translates via the Euclidean 
distance to the similarity matrix. From each cluster 100 
example objects are drawn and we fit bipartite-graph and 
if -means models using if = 1, 2 and 3. The results are 
averaged over 50 fits for each value of a^. 



Results of this test are shown in Fig. 10 The evidence 
for two clusters (panel (a)) rapidly degrades for increasing 
variance for the bipartite-graph model as well as the if- 
means algorithm, as is to be expected. Inspecting the mis- 
classification rate (panel (b)) reveals that both algorithms 
are insensitive to crl until a critical variance is reached 
where both misclassification rates increase abruptly. For the 
if- means algorithm, this break down happens at Cy ~ 30, 
whereas the bipartite- graph model breaks down at ^40, 
which amounts to ^ « 1.6 in this setup. The evidence for 

two clusters (panel (a)) rises again for larger variances, al- 
though both algorithms have already broken down. This 
is a geometric effect: With increasing <j^, the two clus- 
ters become more extended in ^/-direction, until it becomes 
favourable to split the data along x = rather than y = 0. 
This also explains why the misclassification rate is 50% in 
this regime. Consequently, the abrupt break-down origi- 
nates from the setup of this test. Sampling more objects 
from each cluster might have prevented this effect, but 
would have increased the computational effort drastically. 
Moreover, this also demonstrates that isotropic distance 
measu res are problematic. Using e.g. a diffusion distance 
(e.g. |Richards et al.| 2009) may solve this problem. The 
break down is less abrupt in the mean squared-error loss 
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(panel (c)), since (Cse) is also sensitive to posterior mises- 
timation that do not lead to misclassifications. 

We conclude that the bipartite-graph model is fairly 
insensitive to noise of this kind over a broad range, until 
the setup of this test breaks down. 

4.6. Impact of cluster cardinalities 

Typically different types of galaxy morphologies have dif- 
ferent abundancies in a given data sample. For instance, 
Bamford et al. (2009) observe different type fractions of 
early- type and spiral galaxies in the Galaxy Zoo project. 
Therefore, we now investigate how many objects of a cer- 
tain kind are needed in order to detect them as a cluster. 
The concept of a number of objects being members of a cer- 
tain cluster is poorly defined in the context of soft cluster- 
ing. We generalise this concept by defining the cardinality 
of a cluster Ck 



N 

card(/c) = prob(ck\x n ) . 



n=l 



(40) 



This definition satisfies Ylk=i car d(&) = since the clus- 
ter posteriors are normalised. In the case of hard clustering, 
Eq. (40 ) is reduced to simple number counts, where the clus- 



ter posteriors become Kronecker symbols. We use two clus- 
ters, both one-dimensional Gaussians with unit variance 
and a fixed two-cluster separation of Ax = 10. We then 
vary the number of objects drawn from each cluster such 
that the resulting data set always contains 200 objects. For 
each data set, we compute two different similarity matrices: 
First, we compute the similarity matrix using the optimal 
a for a constant s = 2.0, according to the recipe given in 
Sect. |4.3| This similarity measure is adapted to every data 
set (adaptive similarity measure). Second, we compute the 
similarity matrix using a = 0.6 and s = 2.0, which is the 
optimal similarity measure for the data set composed to 
equal parts of objects from both clusters. This similarity 
measure is the same for all data sets (constant similarity 
measure). To each of the two similarity matrices we fit a 
bipartite-graph model using K = 2 and average the results 
over 50 fits. 

The results are summarised in Fig. 11 Panel (a) shows 



the dependence of the misclassification rate on the cardi- 
nality of cluster A. For the adaptive similarity measure 
the bipartite-graph model will break down, if one cluster 
contributes less than 10% to the data set. For the con- 
stant similarity measure it will break down, if one cluster 
contributes less than 3%. The same behaviour is evident 
from the squared-error loss in panel (b). This problem is 
caused by the larger group in the data set dominating the 
statistics of the modified Manhattan distance C defined by 
Eq. (38). This is a failure of the similarity measure, not of 



the bipartite-graph model. The constant similarity measure 
stays "focussed" on the difference between the two clusters 
and its break-down at 3% signals the limit to which clusters 
are detectable with the bipartite-graph model. 

Panel (c) in Fig. ITT] shows the correlation of the mea- 
sured cluster cardinality to the true cluster cardinality. For 
the constant similarity measure, both quantities correlate 
well. In contrast to this, for the adaptive similarity measure 
the two quantities do not correlate at all. Again, the adap- 
tive similarity measure is dominated by the larger group, 
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Figure 11. Impact of cardinalities on clustering results, 
(a) Mean misclassification rate, (b) Mean squared-error loss, (c) 
Correlation of estimated and true cluster cardinality. 



i.e. the similarities between the large and the small group 
are systematically too high. This leads to a systematic un- 
derestimation of the maximum cluster posteriors (cf. panel 
(b)), since for a two-cluster separation of Ax = 10 the true 
posteriors are essentially 100% as shown by Fig. [9fc. This 



also affects the cluster cardinalities defined by Eq. (40). If 
the cluster overlap is stronger, then this bias is likely to 
lead to misclassifications, too. 

We c oncl ude that the optimal similarity measure defined 
in Sect. 4.3 fails to discover groups that contribute 10% 
or less to the complete data sample. A different similarity 
measure may solve this problem, but the optimal similarity 
measure has the advantage of minimising the misclassifi- 
cation rate and the squared-error loss for the discovered 
groups. 



5. Worked Example with SDSS Galaxies 

In this section we present our worked example with SDSS 
galaxies. First, we describe the sample of galaxies we anal- 
yse. Before applying the bipartite-graph model to the whole 
sample, we apply it to a small subsample of visually clas- 
sified galaxies to prove that it is working not only for sim- 
ple simulated data but also for real galaxy morphologies. 
Again, we emphasise that this is just meant as a demonstra- 
tion, so parametrisation and sample selection are idealised. 
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5.1. The data sample by Fukugita et al. (2007) 



Fukugita et al. (2007) derived a catalogue of 2,253 bright 
galaxies with Petrosian magnitude in the r band brighter 
than Tp = 16 from th e Third Data Release of the SDSS 
( |Abazajian et al.|2005[ ) . We analyse only g-band imaging of 
this sample, which is sensitive to HII regions and spiral arm 
structures. We expect that objects that are bright in r are 
also bright in the neighbouring g-band. Therefore, all these 
objects have a high signal-to-noise ratio, i.e. the shapelet 
decomposition can employ a maximum order sufficiently 
large to reduce possible modelling problems. 

Apart from the g-band imaging data, we also retrieved 
further morphological information from the SDSS database, 
namely Petrosian radii r 5 o and rgo containing 50% and 90% 
of the Petrosian flux, ratios of isophotal semi major and 
semi minor axis, and the logarithmic likelihoods of best- 
fitting de Vaucouleurs and exponential-disk profiles. Given 
the Petrosian radii r$o and rgo containing 50% and 90% 
of the Petr osian flux, we de fine the concentration index in 
analogy to Conselice ( 2003), 



C = 5 loe 



^90 
^50 



(41) 



For compact objects, such as elliptical galaxies, this con- 
centration index is large, whereas it is smaller for extended 
objects with slowly decreasing light profiles, such as disk 
galaxies. 

We then reduce the data sample in three steps: First, 
we sort out peculiar objects, i.e. objects that are definitely 
not galaxies, blended objects and objects that were cut in 
the mosaic. All these objects have no viable galaxy mor- 
phologies. This was done by visual inspection of all objects. 
Second, we decompose all images into shapelets using the 
same maximum order iVmax = 12 (91 expansion coeffi- 
cients) for all objects. The shapelet code performs several 
internal data processing steps, namely estimating the back- 
ground noise and subtracting the potentially non-zero noise 
mean, image segmentation and masking of multiple objects, 
estimating the object centroid position (cf. |Melchior et al.| 
2007). Third, we sort out objects for which the shapelet 
reconstruction does not provide reasonable models. This is 
done by discarding all objects whose best fits have a re- 
duced x 2 that is not in the interval [0.9,2.0]. The lower 
limit is chosen very close to unity, since shapelets have 
the tendency to creep into the background noise and over- 
fit objects. Setting ou t from the 2,253 bright galaxies of 
Fukugita et al. (2007), the data processing leaves us with 
1,520 objects with acceptable x 2 - We check that the mor- 
phological information contained in the original data set 
and the reduced data set does not differ systematically, by 
comparing the sample distributions of Petrosian radii, axis 
ratios, concentration indeces, and logarithmic likelihoods 
of best-fitting deVaucouleur and exponential-disk profiles. 
All objects are large compared to the point-spread func- 
tion (PSF) of SDSS, such that a P SF deconvolution as de- 
scribed in |Melchior et al. (2009a) is not necessary. This 
means we analyse apparent instead of intrinsic morpholo- 
gies, but both are approximately the same. 



5.2. Demonstration with three clusters 

In this section we apply the soft clustering algorithm by 
Yu et al. (2005) for the first time to real galaxies. We use 
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number of clusters K 



Figure 12. Mean angular changes (A(K)} of bipartite- 
graph model for data set composed of edge-on disks, face-on 
disks and ellipticals. 



a small data set of 84 galaxies, which we visually classi- 
fied as edge-on disk, face-on disk or ellipticals (28 objects 
per type). As these 84 galaxies were very large and very 
bright, we decomposed them anew using a maximum or- 
der of iVmax = 16, resulting in 153 shapelet coefficients 
per object. Figure [I] shows one example object and its 
shapelet reconstruction for each type. This data set exhibits 
a strong grouping and we demonstrate that the bipartite- 
graph model indeed discovers the edge-on disks, face-on 
disks and ellipticals automatically, without any further as- 
sumptions. 

The estimation of the number of clusters is shown in Fig. 
12 The mean angular changes in SSR(if) averaged over 
20 fits indeed reveal only one significant kink at K = 3. 
The lowest value of SSR at K = 3 is SSR « 48, which 
corresponds to an RMS residual (cf. Eq. ([28]) ) of 



SSR 



\N(N- 



1) 



11.6% 



(42) 



The denominator ^N(N + 1) is the number of independent 
elements in the symmetric similarity matrix. 

We conclude from Fig. [12] that the bipartite-graph 
model indeed favours three clusters. However, we still have 
to prove that the similarity matrix contains sufficient infor- 
mation on the data and that the bipartite-graph model dis- 
covers the correct classes. For K = 3, the cluster posteriors 
populate a two-dimensional plane because they are subject 
to a normalisation constraint. This plane is shown in Fig. 
p~3| Indeed, the distribution of cluster posteriors exhibits an 
excellent grouping of ellipticals, edge-on disks and face-on 
disks. The three clusters are well separated, apart from two 
objects labelled as edge-on disks but assigned to the cluster 
of ellipticals. A second visual inspection of these two "out- 
liers" revealed that we had initially misclassified them as 
edge-on disk. The excellent results are particularly impres- 
sive if we remember that we analysed 84 data points dis- 
tributed in a 153-dimensional parameter space. Moreover, 
it is very encouraging that the soft clustering analysis did 
indeed recover the ellipticals, face-on and edge-on disks au- 
tomatically. 

In order to get an impression of how good these results 
actually are, we compare the cluster posterior plane to re- 
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Figure 13. Cluster posterior space of bipartite-graph 
model for edge-on disks, face-on disks and ellipticals. 
The triangle defines the subspace allowed by the normalisation 
constraint of the posteriors. The corners of the triangle mark 
the points of 100% posterior probability. The * indicates the 
point where all three posteriors are equal. Colours encode a- 
priori classifications unknown to the algorithm. 
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Figure 14. Comparing Fig. 



13 



with results of PCA for 



edge-on disks, face-on disks and ellipticals. 
Parameter space spanned by the first two principal components. 
The first principal component carries ~ 45.2% and the second 
« 21.4% of the total variance. Colours encode a-priori classifi- 
cations unknown to the PCA algorithm. 



suits obtained from PCA. Therefore, we estimate the co- 
variance matrix E of the data sample in shapelet-coefncient 
space and diagonalise it. Only the first 83 eigenvalues of E 
are non-zero, since the 84 data objects poorly constrain the 
153 x 153 covariance matrix. The first two principal compo- 
nents carry 66.6% of the total sample variance and Fig. 14 
displays the parameter space spanned by them. Obviously, 
PCA performs well in reducing the parameter space from 
153 dimensions down to two, since the ellipticals, face-on 
and edge-on disks exhibit a good groupingjj However, the 
bipartite-graph model provides much more compact and 
well-separated groups. This is due to the degeneracies we 
have broken when we computed the minimal spherical dis- 



5 PCA only reduces the parameter space, but does not assign 
classes to objects. 
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Figur e 15. Est imating the number of clusters in the data 
set of |Fukugita et al.| ( |2007| ). 

Mean angular changes (A( K)) are averaged over 15 Fits. 



tances as described in Sect.[2j In case of PCA, these degen- 
eracies are unbroken and introduce additional scatter. 

In both Figs. [13] and [14] we notice that the group of ellip- 
ticals is significantly more compact than the groups of face- 
on and edge-on disks. This is caused by three effects: First, 
as discussed in Sect. [2j our parametrisation of elliptical 
galaxies is problematic, thereby introducing common arte- 
facts for all objects of this type. These common features are 
then picked up by the soft-clustering algorithm. Ironically, 
the problems of the parametrisation help to discriminate 
the types in this case. Second, we described in Sect.[2]how to 
make our morphological distance measure invariant against 
various random quantities, namely image size, image flux, 
orientation angle and handedness. However, the distance 
measure and thereby the similarity measure are not invari- 
ant against the inclination angle w.r.t. the line of sight, 
which introduces additional scatter into the clustering re- 
sults. We expect that the impact of this random effect is 
smaller for ellipticals than for disk galaxies. Third, disk 
galaxies usually exhibit complex substructures (e.g. spiral 
arms or star-forming regions), whereas elliptical galaxies 
do not. Consequently, the intrinsic morphological scatter 
of disk galaxies is larger than for ellipticals. 

5.3. Analysing the data set of Fukugita et al. (2007) 

We now present the soft-clustering results from analysing 
all 1,520 bright gala xies from the reduced data set of 
Fukugita et al. ( 2QQ7[ ). We have chosen the similarity mea- 
sure with s = 1. 02 a nd corresponding optimal a « 0.12, 
according to Sect. 4.3 The shapes of the curves of the modi- 
fied Manhattan distances C(a) are of the same generic form 
as before. Fit results of the similarity matrix for K rang- 
ing from 1 to 20 are shown in Table [2] and Fig. [15] There 
are significant deviations of the mean angular changes from 
zero for K = 3 and K = 8. The signal at K = 2 is ignored, 
since the SSR value is very high (cf. Table [2). 

First, we investigate the clustering results for K = 3, 
where we have SSR « 6, 146 (cf. Ta ble [2| corresponding to 
an RMS residual of « 3.7% (cf. Eq. (|42f) for the similarity- 
matrix reconstruction. In Fig.[l6]we snow the top five exam- 
ple objects for each of the three clusters together with a his- 
togram of the distribution of cluster posteriors. Inspecting 
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K 


minimal SSR 


mean angular changes (degrees) 


1 


39, 220 




2 


12, 313 


14.489 ± 0.047 


3 


6, 146 


22.67 ± 0.14 


4 


4, 965 


—2.01 ± 0.19 


5 


3, 868 


2.76 ± 0.27 


6 


3, 155 


2.19 ± 0.61 


7 


2, 676 


—0.89 ± 1.17 


8 


2, 254 


5. 91 ± 0.69 


9 


2, 093 


—0.18 ± 0.95 


10 


1, 931 


A Of 1 111 

—0.35 ± 1.11 


11 


1, 790 


0.24 ± 0.86 


12 


1, 661 


1 c\ l a rA 

1.92 ± 0.52 


13 


1, 593 


0.36 ± 0.35 


14 


1, 532 


r\ no 1 /~v r7 o 

0.03 ± 0.73 


10 




n 1 c; _|_ n QzL 


16 


1,430 


0.86 ±0.71 


17 


1,405 


0.04 ± 0.43 


18 


1,383 


0.22 ±0.39 


19 


1,365 


0.14 ±0.34 


20 


1,348 





Table 2. Fitting the similarity matrix of 1,520 objects. 
We present the minimal SSR value out of 15 fits and the 
mean angular change averaged over 15 fits. 



the example images, we clearly see that the first cluster is 
obviously composed of face-on disk galaxies, whereas the 
second cluster contains ellipticals. The third cluster is the 
cluster of edge-on disk galaxies or disks with high inclina- 
tion angles. However, a blended object has been misclassi- 
fied into this cluster, too. There are still some blended ob- 
jects left that we failed to remove, since when sorting out 
blended objects we visually inspected the images in reduced 
resolution. The cluster posteriors for K = 3 are very infor- 
mative: First, we notice that objects from cluster 1 have 
typically very low posteriors in cluster 2 and intermediate 
posteriors in cluster 3, i.e. face-on disks are more similar 
to edge-on disks than to ellipticals. Second, objects from 
cluster 2 have low posteriors in all other clusters. Third, 
objects from cluster 3 tend to be more similar to objects in 
cluster 2, i.e. edge-on disks are more similar to ellipticals. 
This is probably due to the higher light concentration and 
steep light profiles. 

These results demonstrate that the clustering analy- 
sis indeed yields reasonable results for realistic data sets. 
Furthermore, the results for three clusters are very similar 
to the clustering scheme of Sect. |5.2| However, three clus- 
ters are not enough to describe the data faithfully. This is 
evident from the much larger SSR value for K = 3 com- 
pared to K = 8 and from Fig. [17| where we show the re- 
sulting cluster posterior space for K = 3. Large parts of the 
available posterior space remain empty whereas the central 
region is crowded. This behaviour is due to the lack of com- 
plexity in the bipartite-graph model and strongly suggests 
that more clusters are necessary. 

For K = S we have SSR « 2,254 (cf. Table [§, 
which corresponds to an RMS residual of « 2.2% for the 
similarity-matrix reconstruction. We show ten top example 
objects for each cluster in Fig. 18 First, we notice that the 
resulting grouping is excellent. However, it is difficult to 
understand the differences between some clusters. Clusters 
1 and 5 are obviously objects with high ellipticities, e.g. 
edge-on disks, but what is their difference? Is it the bulge 
dominance which is much weaker in cluster 1 than in 5? 
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Figure 16. Top example objects for K = 3 clusters. 
Each row corresponds to a cluster. We also show the distribution 
of its cluster posteriors beneath each object. Cluster 1 seems to 
contain face-on disks, cluster 2 compact objects, and cluster 3 
edge-on disks. 
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Figure 17. Cluster posterior space for K = 3. 
Projected cluster posteriors are displayed 10% translucent such 
that their density becomes visible. See Fig.^Jfor an explanation 
of the topology of this plot. 



Do the clusters differ in their radial light profiles? What 
is the difference between clusters 2 and 7 which are both 
face-on disks? Of particular interest are clusters 3 and 8, 
where both seem to contain roundish and compact objects. 
However, the posterior histograms reveal a highly asym- 
metric relation: Objects from cluster 3 also prefer cluster 
8 above all other clusters. Nevertheless, most of the top 
examples of cluster 8 have extremely low posteriors in clus- 
ter 3, i.e. association with cluster 3 is highly disfavoured. 
Although we cannot explain this result without further in- 
vestigation, it is interesting that the algorithm picked up 
such a distinctive signal. 

As we have access to the isophotal axis ratio and the 
concentration index (cf. Eq. (41)) for all objects, we inves- 
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Figure 18. Top example objects for K = 8 clusters. 

Each row corresponds to a cluster. For each object, we also show the histogram of the distribution of its cluster posteriors beneath 
it. The objects were aligned in shapelet space, not in real space. 
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Figure 19. Mean axis ratios and concentration indices for 



the clusters of Fig. 18 



Weighted means were computed from the top 100 example ob- 
jects of each cluster. We show the contours of la and take into 
account possible correlations. 



tigate their distributions for the clusters. Figure [19] shows 
the mean axis ratios and the mean concentration indices for 
all eight clusters averaged over the 100 top examples. The 
cluster with the highest mean axis ratio is cluster 1, which 
is the cluster of edge-on disk galaxies. The cluster with low- 
est concentration index is cluster 7, which is the cluster of 
face-on disk galaxies that exhibit extended smooth light 
profiles. Clusters 3, 4, 5 and 8 have the largest concentra- 
tion indices. As is evident from Fig. 18j these clusters are 
indeed composed of rather compact objects that seem to be 
elliptical galaxies. However, there is no decisive distinction 
in Fig. [19] This is not necessarily a flaw in the clustering 
results, but rather more likely caused by concentration and 
axis ratio being an ins ufficient parametrisation scheme (cf. 



Andrae et al. 



m prep. 



It seems like the resulting classification scheme is essen- 
tially face-on disk, edge-on disk and elliptical. If we increase 
the number of clusters, we get further diversification that 
may be caused by bulge dominance or inclination angles. 
We emphasise again that our primary goal is to demon- 
strate that our method discovers morphological classes and 
provides data-to-class assignments that are reasonable. 

6. Conclusions 

We briefly summarise our most important arguments and 
results: 



Galaxy evolution, the process of observation, and 
the experience with previous classification attempts 
strongly suggest a probabilistic ("soft") classification. 
Hard classifications appear to be generically inappro- 
priate. 

There are two distance-based soft-clustering algorithms 
that have been applied to gal axy morphologies s o far: 
Gaus sian mixture models by Kelly fc McKay ([2004, 
2005) and the bipartite-graph model by Yu et al. (12005) 



presented in this work. The weak points of the Gaussian 
mixture model are the dimensionality reduction and 
its assumption of Gaussianity. The weakness of the 
bipartite-graph model is the definition of the similar- 
ity measure. 



— The shapelet formalism, our similarity measure, and the 
bipartite-graph model produce reasonable clusters and 
data-to-cluster assignments for real galaxies. The au- 
tomated discovery of classes corresponding to face-on 
disks, edge-on disks and elliptical galaxies without any 
prior assumptions is impressive and demonstrates the 
great potential of clustering analysis. Moreover, the au- 
tomatically discovered classes have a qualitatively dif- 
ferent meaning compared to pre-defined classes, since 
they represent to grouping that is preferred by the given 
data sample itself. 

— Random effects such as orientation angle and inclination 
are a major obstacle, since they introduce additional 
scatter into a parametrisation of galaxy morphologies. 

— For data sets containing N galaxies, the computation 
times scale as G(N 2 ). Nevertheless, we experienced that 
a clustering analysis is feasible for data sets containing 
up to TV = 10,000 galaxies without employing super- 
computers. We conclude that a clustering analysis on a 
data set of one million galaxies is possible using super- 
computers. 

— It is possible to enhance this method by setting up 
a classifier based on the classes found by the soft- 
clustering analysis, thereby improving the time com- 
plexity from G(N 2 ) to G(N). 

— The method presented in this paper is not limited to 
galaxy morphologies only. For instances, it could pos- 
sibly be applied to automated star-galaxy classification 
or AGN detection. 

The bottom line of this paper is that automatic discov- 
ery of morphological classes and object-to-class assignments 
(clustering analysis) does work and is less prejudiced and 
time-consuming than visual classifications, though the in- 
terpretation of the results is still an open issue. Especially 
when analysing new data samples for the first time, clus- 
tering algorithms are more objective than using pre-defined 
classes and visual classifications. The advantages of such a 
sophisticated statistical algorithm justify its considerable 
complexity. 
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